⚡ Bolt: [performance improvement] Optimize keyword delimiter matching by bashandbone · Pull Request #281 · knitli/codeweaver

bashandbone · 2026-03-27T12:40:07Z

💡 What:
The _match_keyword_delimiters method in the DelimiterChunker has been optimized. Instead of looping through each potential keyword delimiter and running a separate re.finditer search across the entire document text, the code now combines all keyword start strings into a single compiled regex pattern using alternation (?:...) and maps matched text back to the delimiter objects.

🎯 Why:
Iterating re.finditer for each keyword results in $O(K \times N)$ regex evaluation time (where $K$ is the number of keywords and $N$ is the size of the text). By compiling them into a single pattern, the algorithm performs a single $O(N)$ pass, avoiding redundant full-text scans and context switching. An added guard clause ensures the optimized pattern does not compile an empty expression if the keyword_delimiters list is unexpectedly empty.

📊 Impact:
Significantly reduces parsing overhead for code chunking, especially on larger files with many potential keyword delimiters. Micro-benchmarking confirms the single-pass combined regex pattern reduces execution time compared to individual regex evaluation loops in extreme scenarios.

🔬 Measurement:
To verify the improvement, run mise //:test tests/unit/engine/chunker/test_delimiter_patterns.py to ensure matching behavior remains precisely equivalent while leveraging the new optimized compiled regex block.

PR created automatically by Jules for task 7248057790755398651 started by @bashandbone

Summary by Sourcery

Optimize keyword delimiter matching in the delimiter chunker for improved performance while preserving existing behavior.

Enhancements:

Replace per-keyword regex scanning with a single combined regex over all keyword delimiters to reduce parsing overhead.
Add a guard clause to skip keyword matching when no keyword delimiters are present.
Correct a parameter docstring to use standard American spelling for "normalized".

This commit modifies `_match_keyword_delimiters` in `src/codeweaver/engine/chunker/delimiter.py` to significantly improve chunking performance. Instead of calling `re.finditer` for every individual keyword delimiter, the optimization combines all start strings into a single compiled regex pattern. This reduces regex execution overhead and limits the algorithm to making a single pass over the content. Additionally, an early return checks for empty lists to prevent compiling a dangerous empty regex. Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

google-labs-jules · 2026-03-27T12:40:08Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

sourcery-ai · 2026-03-27T12:40:13Z

Reviewer's Guide

Optimizes keyword delimiter matching in DelimiterChunker by replacing per-keyword regex scans with a single combined compiled regex pass, adding a guard for empty keyword sets, and fixing a minor docstring typo.

Sequence diagram for optimized keyword delimiter matching

sequenceDiagram
    actor Caller
    participant DelimiterChunker
    participant RegexEngine
    participant DelimiterMatchList as Matches

    activate DelimiterChunker
    Caller->>DelimiterChunker: _match_keyword_delimiters(content, keyword_delimiters, matches)
    DelimiterChunker->>DelimiterChunker: filter keyword_delimiters with nonempty start
    DelimiterChunker->>Caller: return matches if keyword_delimiters is empty
    deactivate DelimiterChunker

    activate DelimiterChunker
    DelimiterChunker->>DelimiterChunker: start_strings = [d.start for d in keyword_delimiters]
    DelimiterChunker->>RegexEngine: compile combined_pattern with alternation
    RegexEngine-->>DelimiterChunker: combined_pattern
    DelimiterChunker->>DelimiterChunker: delimiter_map = {d.start: d}

    loop for each match in combined_pattern.finditer(content)
        DelimiterChunker->>RegexEngine: combined_pattern.finditer(content)
        RegexEngine-->>DelimiterChunker: match
        DelimiterChunker->>DelimiterChunker: matched_text = match.group(0)
        DelimiterChunker->>DelimiterChunker: delimiter = delimiter_map[matched_text]
        DelimiterChunker->>DelimiterChunker: keyword_pos = match.start()
        DelimiterChunker->>DelimiterChunker: _is_inside_string_or_comment(content, keyword_pos)
        alt inside string or comment
            DelimiterChunker->>DelimiterChunker: continue to next match
        else outside string and comment
            DelimiterChunker->>DelimiterChunker: _find_next_structural_with_char(content, start, allowed)
            DelimiterChunker-->>DelimiterChunker: struct_start, struct_char
            alt no structural opening
                DelimiterChunker->>DelimiterChunker: continue to next match
            else structural opening found
                DelimiterChunker->>DelimiterChunker: _find_matching_close(content, struct_start, struct_char, closing_char)
                DelimiterChunker-->>DelimiterChunker: struct_end
                alt struct_end is not None
                    DelimiterChunker->>DelimiterChunker: nesting_level = _calculate_nesting_level(content, keyword_pos)
                    DelimiterChunker->>Matches: append DelimiterMatch(delimiter, keyword_pos, struct_end, nesting_level)
                else struct_end is None
                    DelimiterChunker->>DelimiterChunker: continue to next match
                end
            end
        end
    end

    DelimiterChunker-->>Caller: matches

Updated class diagram for DelimiterChunker keyword delimiter matching

classDiagram
    class DelimiterChunker {
        +_match_keyword_delimiters(content: str, keyword_delimiters: list[Delimiter], matches: list[DelimiterMatch]) list[DelimiterMatch]
        -_is_inside_string_or_comment(content: str, position: int) bool
        -_find_next_structural_with_char(content: str, start: int, allowed: set[str]) tuple[int,str]
        -_find_matching_close(content: str, struct_start: int, struct_char: str, closing_char: str) int
        -_calculate_nesting_level(content: str, position: int) int
    }

    class Delimiter {
        +start: str
        +end: str
        +type: str
    }

    class DelimiterMatch {
        +delimiter: Delimiter
        +start_pos: int
        +end_pos: int
        +nesting_level: int
    }

    DelimiterChunker ..> Delimiter : uses
    DelimiterChunker ..> DelimiterMatch : creates

    note for DelimiterChunker "_match_keyword_delimiters now uses a single combined regex pattern and delimiter_map for keyword matching"

File-Level Changes

Change	Details	Files
Optimize keyword delimiter matching by using a single combined regex pattern instead of iterating per keyword.	Add guard clause to early-return when the filtered keyword_delimiters list is empty. Build a combined regex pattern using alternation over all keyword start strings with word boundaries and compile it once before scanning the content. Create a mapping from keyword start strings to their corresponding delimiter objects. Replace the outer loop over keyword_delimiters and per-keyword re.finditer calls with a single loop over combined_pattern.finditer, reusing the existing structural and nesting computations for each match.	`src/codeweaver/engine/chunker/delimiter.py`
Minor documentation cleanup in delimiter loading helper.	Correct spelling in _load_custom_delimiters docstring from 'normalised' to 'normalized' for the normalized_language parameter description.	`src/codeweaver/engine/chunker/delimiter.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The new delimiter_map = {d.start: d ...} assumes keyword start strings are unique; if multiple delimiters can share the same start token, this will silently drop all but one, whereas the previous per-delimiter loop would have produced matches for each.
The combined regex changes the ordering of matches from grouped-by-delimiter to strictly text order; if any downstream logic relies on the previous grouping semantics, consider either preserving the old order or documenting the change explicitly.
You can slightly simplify/optimize by computing allowed = set(structural_pairs.keys()) once before the regex loop and reusing it in _find_next_structural_with_char instead of recreating the set on every match.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `delimiter_map = {d.start: d ...}` assumes keyword `start` strings are unique; if multiple delimiters can share the same start token, this will silently drop all but one, whereas the previous per-delimiter loop would have produced matches for each.
- The combined regex changes the ordering of `matches` from grouped-by-delimiter to strictly text order; if any downstream logic relies on the previous grouping semantics, consider either preserving the old order or documenting the change explicitly.
- You can slightly simplify/optimize by computing `allowed = set(structural_pairs.keys())` once before the regex loop and reusing it in `_find_next_structural_with_char` instead of recreating the set on every match.

## Individual Comments

### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="472" />
<code_context>
-            for match in re.finditer(pattern, content):
-                keyword_pos = match.start()
+        # Create a mapping to quickly look up the delimiter by its matched start string
+        delimiter_map = {d.start: d for d in keyword_delimiters}

-                # Skip if keyword is inside a string or comment
</code_context>
<issue_to_address>
**issue (bug_risk):** Handling duplicate keyword start strings may change behavior compared to the previous implementation.

Previously, each delimiter was processed independently with `re.finditer`, so duplicate `start` values would each be applied (possibly with different end behavior/metadata). With `delimiter_map = {d.start: d for d in keyword_delimiters}`, only the last delimiter for a given `start` is ever used. If duplicate `start` values are valid in this context, this changes semantics. Consider either enforcing unique `start` values up front and failing fast, or mapping each `start` to a list of delimiters to preserve the prior behavior.
</issue_to_address>

### Comment 2
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="477" />
<code_context>
-                struct_start, struct_char = self._find_next_structural_with_char(
-                    content,
-                    start=keyword_pos + len(delimiter.start),
-                    allowed=set(structural_pairs.keys()),
-                )
+            # Skip if keyword is inside a string or comment
</code_context>
<issue_to_address>
**suggestion (performance):** The `allowed` set is rebuilt on every match and could be hoisted out of the loop.

This recreates `set(structural_pairs.keys())` for every keyword occurrence. Since `structural_pairs` is constant in this method, compute `allowed_structurals = set(structural_pairs.keys())` once before `for match in combined_pattern.finditer(content):` and pass `allowed=allowed_structurals` here to avoid repeated allocation in the hot inner loop.

Suggested implementation:

```python
        allowed_structurals = set(structural_pairs.keys())

        for match in combined_pattern.finditer(content):
            matched_text = match.group(0)
            delimiter = delimiter_map[matched_text]
            keyword_pos = match.start()

```

Within the same method where this loop lives, find the call(s) to:
`self._find_next_structural_with_char(...)` that currently pass `allowed=set(structural_pairs.keys())` and change those arguments to `allowed=allowed_structurals` to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use `allowed_structurals`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-03-27T12:41:31Z

src/codeweaver/engine/chunker/delimiter.py

-            for match in re.finditer(pattern, content):
-                keyword_pos = match.start()
+        # Create a mapping to quickly look up the delimiter by its matched start string
+        delimiter_map = {d.start: d for d in keyword_delimiters}


issue (bug_risk): Handling duplicate keyword start strings may change behavior compared to the previous implementation.

Previously, each delimiter was processed independently with re.finditer, so duplicate start values would each be applied (possibly with different end behavior/metadata). With delimiter_map = {d.start: d for d in keyword_delimiters}, only the last delimiter for a given start is ever used. If duplicate start values are valid in this context, this changes semantics. Consider either enforcing unique start values up front and failing fast, or mapping each start to a list of delimiters to preserve the prior behavior.

sourcery-ai · 2026-03-27T12:41:31Z

src/codeweaver/engine/chunker/delimiter.py

+        for match in combined_pattern.finditer(content):
+            matched_text = match.group(0)
+            delimiter = delimiter_map[matched_text]
+            keyword_pos = match.start()


suggestion (performance): The allowed set is rebuilt on every match and could be hoisted out of the loop.

This recreates set(structural_pairs.keys()) for every keyword occurrence. Since structural_pairs is constant in this method, compute allowed_structurals = set(structural_pairs.keys()) once before for match in combined_pattern.finditer(content): and pass allowed=allowed_structurals here to avoid repeated allocation in the hot inner loop.

Suggested implementation:

allowed_structurals = set(structural_pairs.keys()) for match in combined_pattern.finditer(content): matched_text = match.group(0) delimiter = delimiter_map[matched_text] keyword_pos = match.start()

Within the same method where this loop lives, find the call(s) to:
self._find_next_structural_with_char(...) that currently pass allowed=set(structural_pairs.keys()) and change those arguments to allowed=allowed_structurals to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use allowed_structurals.

Copilot

Pull request overview

Optimizes DelimiterChunker._match_keyword_delimiters by scanning the input text once with a combined keyword regex instead of running re.finditer per keyword, aiming to reduce chunking overhead on large files.

Changes:

Combine all keyword-start strings into a single compiled regex and map matches back to delimiters.
Add a guard to avoid compiling an empty keyword regex when keyword delimiters are filtered down to none.
Fix a docstring spelling (“normalised” → “normalized”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T12:45:01Z

src/codeweaver/engine/chunker/delimiter.py

+        start_strings = [d.start for d in keyword_delimiters]
+        combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b")

-            for match in re.finditer(pattern, content):
-                keyword_pos = match.start()
+        # Create a mapping to quickly look up the delimiter by its matched start string
+        delimiter_map = {d.start: d for d in keyword_delimiters}

-                # Skip if keyword is inside a string or comment
-                if self._is_inside_string_or_comment(content, keyword_pos):
-                    continue
+        for match in combined_pattern.finditer(content):
+            matched_text = match.group(0)
+            delimiter = delimiter_map[matched_text]
+            keyword_pos = match.start()


delimiter_map = {d.start: d ...} assumes keyword starts are unique. In this codebase they are not (e.g., STRUCT_PATTERN and TYPE_ALIAS_PATTERN both include start string "type" via families.py), so this change will drop one delimiter and change which kind/priority wins during overlap resolution. Consider mapping start -> list[Delimiter] (or selecting the highest-priority delimiter per start) and emitting matches for each delimiter associated with the matched start string to preserve prior behavior.

Copilot · 2026-03-27T12:45:01Z

src/codeweaver/engine/chunker/delimiter.py

+        # Optimization: Combine all keyword start strings into a single compiled regex pattern.
+        # This allows us to make a single pass over the content rather than iterating over
+        # `re.finditer` for each keyword delimiter individually, significantly reducing overhead.
+        start_strings = [d.start for d in keyword_delimiters]
+        combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b")

-            for match in re.finditer(pattern, content):
-                keyword_pos = match.start()
+        # Create a mapping to quickly look up the delimiter by its matched start string
+        delimiter_map = {d.start: d for d in keyword_delimiters}


This optimization changes how keyword delimiters are associated to matches. There are currently unit tests for delimiter chunking behavior, but none appear to cover the case where multiple keyword delimiters share the same start (e.g., type being both STRUCT and TYPE_ALIAS in family patterns). Adding a test that asserts the resolved boundary kind/priority for such an input would help prevent regressions here.

github-actions · 2026-03-27T12:45:15Z

Code Review - PR #281: Optimize keyword delimiter matching

Summary

This PR implements a valuable performance optimization by consolidating multiple regex searches into a single pass. The core idea is sound and addresses a real O(K × N) → O(N) performance improvement. However, there are a few considerations to address.

✅ Strengths

Excellent optimization approach: Combining all keyword patterns into a single regex with alternation ((?:pattern1|pattern2|...)) is the correct approach for reducing redundant text scanning.
Good documentation: The inline comments clearly explain the optimization rationale and approach.
Correctness preserved: The logic flow remains identical to the original implementation - the optimization is purely algorithmic, not behavioral.
Early exit guard: The added early return for empty keyword_delimiters list (line 454-455) prevents potential issues with compiling an empty regex pattern.
Minor fix: The typo correction in the docstring (normalised → normalized) improves consistency.

🔍 Issues & Suggestions

1. Performance issue: Set allocation in hot loop ⚠️

Location: src/codeweaver/engine/chunker/delimiter.py:487

The suggested optimization in the PR comment is valid and should be implemented. Currently, set(structural_pairs.keys()) is being recreated on every iteration of the match loop, which partially negates the performance benefits of the optimization.

Current code:

for match in combined_pattern.finditer(content):
    # ... 
    struct_start, struct_char = self._find_next_structural_with_char(
        content,
        start=keyword_pos + len(delimiter.start),
        allowed=set(structural_pairs.keys()),  # ❌ Recreated every iteration
    )

Recommended fix:

# Define structural delimiters that can complete keywords
structural_pairs = {
    "{": "}",
    ":": "\n",
    "=>": "",
}
allowed_structurals = set(structural_pairs.keys())  # ✅ Compute once

# ... pattern compilation ...

for match in combined_pattern.finditer(content):
    # ...
    struct_start, struct_char = self._find_next_structural_with_char(
        content,
        start=keyword_pos + len(delimiter.start),
        allowed=allowed_structurals,  # ✅ Reuse
    )

2. Edge case: Duplicate keyword starts

Severity: Low (likely not an issue in practice)

The delimiter_map dictionary uses d.start as the key:

delimiter_map = {d.start: d for d in keyword_delimiters}

If multiple Delimiter objects share the same .start string but differ in other attributes (e.g., kind, priority, end), only the last one will be retained in the map due to dictionary key collision.

Questions:

Is it possible for keyword_delimiters to contain multiple delimiters with the same start string?
If so, should all such delimiters be matched, or is taking the last one acceptable?

If this is a concern, consider using a dict mapping to a list:

from collections import defaultdict
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

# Then in the loop:
for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    for delimiter in delimiter_map[matched_text]:
        # ... process each delimiter ...

However, this may be over-engineering if the codebase guarantees unique keyword starts.

🧪 Testing

The PR description mentions running:

mise //:test tests/unit/engine/chunker/test_delimiter_patterns.py

Recommendations:

Ensure tests cover cases with multiple keyword delimiters
Verify edge cases like empty delimiter lists (already handled)
Consider adding a performance benchmark test to validate the O(N) improvement

📊 Overall Assessment

Approval status: ⏸️ Conditional approval

This is a solid optimization that significantly improves performance for code with many keyword delimiters. The approach is correct and the code is well-documented.

Required changes:

✅ Must fix: Hoist set(structural_pairs.keys()) out of the loop (as suggested in comment #3000778585)

Optional improvements:
2. Consider edge case handling for duplicate keyword starts (if applicable)
3. Verify test coverage for the optimization

Once the set allocation issue is addressed, this PR will be ready to merge.

🎯 Recommendation

Please apply the suggested fix from comment #3000778585 to complete the optimization. The change is straightforward and will ensure the performance benefits are fully realized.

Great work on identifying and implementing this optimization! 🚀

github-actions · 2026-03-27T12:46:42Z

Review of PR #281: Optimize keyword delimiter matching

Thanks for this performance optimization! The approach of combining keyword delimiters into a single regex pattern is sound and should indeed provide O(N) performance instead of O(K × N). However, I've identified a critical correctness bug that needs to be addressed before merging.

🐛 Critical Bug: Dictionary Mapping Loses Duplicate Start Strings

Issue Location: src/codeweaver/engine/chunker/delimiter.py:472

The new implementation uses a dictionary to map start strings to delimiter objects:

```python
delimiter_map = {d.start: d for d in keyword_delimiters}
```

Problem: When multiple delimiters have the same `start` string but different semantic properties (kind, priority), the dictionary only keeps the last one, silently dropping the others.

Real-world example: The codebase defines:

`STRUCT_PATTERN` with starts `["struct", "type"]` (kind: STRUCT)
`TYPE_ALIAS_PATTERN` with starts `["type", "typedef", "typealias"]` (kind: TYPE_ALIAS)

When the code encounters the keyword `type`, the old implementation would:

✅ Create matches for BOTH STRUCT and TYPE_ALIAS delimiters
✅ Allow the priority system to resolve which interpretation is correct

The new implementation will:

❌ Only use ONE of the two delimiters (whichever is last in the list)
❌ Miss valid matches for the other delimiter type

🔧 Suggested Fix

Replace the simple dictionary with a multi-value mapping:

```python

Create a mapping to look up ALL delimiters by their matched start string

from collections import defaultdict
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
delimiter_map[d.start].append(d)
```

Then update the matching loop to iterate through all matching delimiters:

```python
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
matching_delimiters = delimiter_map[matched_text] # Get all delimiters with this start

for delimiter in matching_delimiters:  # Process each one
    keyword_pos = match.start()
    
    # Skip if keyword is inside a string or comment
    if self._is_inside_string_or_comment(content, keyword_pos):
        continue
    
    # ... rest of the logic remains the same

```

✅ What's Good

Performance optimization is well-motivated - The O(K × N) → O(N) improvement is significant for large files
Guard clause for empty list - Good defensive programming (line 454)
Code is well-documented - Clear comments explaining the optimization
Maintains single regex compilation - Efficient approach

📋 Additional Recommendations

Add a test case that verifies behavior when multiple delimiters share the same start string (e.g., test "type" keyword matching both STRUCT and TYPE_ALIAS)
Consider deduplication - The `start_strings` list at line 468 could have duplicates. Use `list(set(...))` or `list(dict.fromkeys(...))` to deduplicate while preserving order
Minor: The docstring typo fix (`normalised` → `normalized`) at line 1293 is good, but unrelated to this PR

🎯 Verdict

Request Changes - The correctness bug must be fixed before merging. The optimization approach is excellent, but the implementation needs to handle multiple delimiters per start string correctly.

Test Coverage: The existing test file `tests/unit/engine/chunker/test_delimiter_patterns.py` doesn't appear to have specific tests for `_match_keyword_delimiters` behavior. Consider adding integration tests that verify "type" keyword matches both STRUCT and TYPE_ALIAS patterns.

github-actions · 2026-03-27T12:49:47Z

Code Review

I've reviewed this PR and identified a critical bug in the optimization that needs to be addressed before merging.

🐛 Critical Issue: Duplicate Start String Handling

Location: src/codeweaver/engine/chunker/delimiter.py:472

The optimization introduces a bug when multiple keyword delimiters share the same start string. The current implementation creates a dictionary mapping:

delimiter_map = {d.start: d for d in keyword_delimiters}

Problem: When multiple delimiters have the same start value, only the last one is retained in the dictionary, causing earlier delimiters to be lost.

Concrete Example:
Looking at src/codeweaver/engine/chunker/delimiters/patterns.py:

Line 146: STRUCT_PATTERN = DelimiterPattern(starts=["struct", "type"], ...) with kind DelimiterKind.STRUCT
Line 161: TYPE_ALIAS_PATTERN = DelimiterPattern(starts=["type", "typedef", "typealias"], ...) with kind DelimiterKind.TYPE_ALIAS

Both patterns include "type" as a start keyword but with different kinds (STRUCT vs TYPE_ALIAS) and potentially different priorities. With the current optimization, when the code encounters the keyword type, only one of these delimiters will be considered—whichever was processed last when building the map.

🔧 Suggested Fix

The delimiter map needs to support multiple delimiters per start string. Here are two approaches:

Option 1: Map to list of delimiters

from collections import defaultdict

delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    matching_delimiters = delimiter_map[matched_text]
    
    for delimiter in matching_delimiters:
        keyword_pos = match.start()
        # ... rest of the matching logic

Option 2: Include all duplicates in regex and match all
Keep the combined pattern but process all delimiters that match the found text:

for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    matching_delimiters = [d for d in keyword_delimiters if d.start == matched_text]
    
    for delimiter in matching_delimiters:
        # ... matching logic

Option 1 is more efficient as it pre-builds the lookup structure.

✅ Positive Aspects

Performance optimization is sound in principle - combining regex patterns into a single pass is a valid optimization technique
Good documentation - the inline comments clearly explain the optimization intent
Guard clause added - the early return for empty keyword_delimiters is a good addition
Spelling fix - correcting "normalised" to "normalized" in the docstring (line 1293)

📋 Testing Recommendation

As mentioned in the Copilot review comment, this PR should include a test that validates behavior when multiple keyword delimiters share the same start string. Suggested test case:

def test_keyword_delimiters_with_duplicate_starts():
    """Test that multiple delimiters with same start string are all matched."""
    # Test code that uses 'type' keyword which appears in both STRUCT and TYPE_ALIAS patterns
    code = """
    type MyStruct struct {
        field string
    }
    
    type MyAlias = string
    """
    
    # Verify that both STRUCT and TYPE_ALIAS boundaries are detected
    # and resolved based on context/priority

🎯 Recommendation

Request Changes - The bug must be fixed before merging. Once the duplicate start handling is corrected and appropriate tests are added, this optimization will be a valuable improvement to the codebase.

📊 Impact Assessment

Correctness: ❌ Critical bug affecting delimiter matching
Performance: ✅ Optimization approach is sound
Test Coverage: ⚠️ Missing tests for edge case
Code Quality: ✅ Well-documented, clean code style
Security: ✅ No security concerns

Next Steps:

Fix the delimiter_map to handle multiple delimiters with the same start string
Add test coverage for this edge case
Verify all existing tests still pass
Re-run performance benchmarks to confirm optimization still delivers expected gains

github-actions · 2026-03-27T12:53:50Z

Code Review

I've reviewed this PR and found a critical bug in the optimization that will cause incorrect behavior when there are multiple keyword delimiters with the same start string.

Issue: Dictionary Comprehension Loses Duplicate Start Strings

On line 472 in src/codeweaver/engine/chunker/delimiter.py:

delimiter_map = {d.start: d for d in keyword_delimiters}

This dictionary comprehension will silently drop all but the last delimiter when multiple delimiters share the same start string. For example, if you have:

Delimiter 1: start="function", end="{", priority=70
Delimiter 2: start="function", end="=>", priority=65

The map will only contain Delimiter 2, and Delimiter 1 will be completely lost. This breaks the original behavior where each delimiter was processed independently.

The Fix

To maintain correctness while keeping the performance optimization, you need to use a multi-value map. Here's the corrected approach:

# Create a mapping to quickly look up delimiters by their matched start string
# Use a list to handle multiple delimiters with the same start string
from collections import defaultdict
delimiter_map: dict[str, list[Delimiter]] = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    keyword_pos = match.start()
    
    # Process ALL delimiters that match this start string
    for delimiter in delimiter_map[matched_text]:
        # Skip if keyword is inside a string or comment
        if self._is_inside_string_or_comment(content, keyword_pos):
            continue
        
        # ... rest of the logic

Additional Observations

Positives:
✅ The core optimization idea is sound - combining regexes into a single pattern is the right approach
✅ Code is well-commented explaining the optimization rationale
✅ Guard clause for empty keyword_delimiters is good
✅ Minor fix: corrected spelling "normalized" (line 1293)

Concerns:
⚠️ The experimental Python 3.13t and 3.14t tests are failing (though these are experimental free-threaded builds)
⚠️ CI logs aren't available yet to determine if failures are related to this change

Recommendation

Request Changes - The duplicate key issue must be fixed before merging. The optimization is valuable, but correctness cannot be compromised.

Once fixed, this will be an excellent performance improvement for keyword delimiter matching.

aiedwardyi · 2026-03-28T08:17:53Z

I pulled this down locally to address the duplicate-keyword-start bug from the review. I don't have push access to bolt-optimize-chunker-delimiter-7248057790755398651, so I opened #282 with the minimal fix plus a regression test.

The change preserves all delimiters that share the same keyword start instead of collapsing them to the last map entry, which restores the original per-delimiter behavior while keeping the single-pass regex optimization. If you'd rather keep everything on #281, feel free to cherry-pick 196c3ad5.

bashandbone · 2026-03-28T20:49:06Z

closing in favor of #282

bashandbone · 2026-03-28T20:49:31Z

closing in favor of #282

google-labs-jules · 2026-03-28T20:51:26Z

closing in favor of #282

Understood. Acknowledging that this work is now obsolete and stopping work on this task.

google-labs-jules · 2026-03-28T20:51:27Z

closing in favor of #282

Understood. Acknowledging that this work is now obsolete and stopping work on this task.

Copilot AI review requested due to automatic review settings March 27, 2026 12:40

Copilot started reviewing on behalf of bashandbone March 27, 2026 12:40 View session

sourcery-ai bot reviewed Mar 27, 2026

View reviewed changes

Copilot AI reviewed Mar 27, 2026

View reviewed changes

aiedwardyi mentioned this pull request Mar 28, 2026

fix: preserve duplicate keyword delimiter matches #282

Merged

bashandbone closed this Mar 28, 2026

bashandbone reopened this Mar 28, 2026

bashandbone closed this Mar 28, 2026

Conversation

bashandbone commented Mar 27, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

google-labs-jules bot commented Mar 27, 2026

Uh oh!

sourcery-ai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for optimized keyword delimiter matching

Updated class diagram for DelimiterChunker keyword delimiter matching

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 27, 2026

Code Review - PR #281: Optimize keyword delimiter matching

Summary

✅ Strengths

🔍 Issues & Suggestions

1. Performance issue: Set allocation in hot loop ⚠️

2. Edge case: Duplicate keyword starts

🧪 Testing

📊 Overall Assessment

🎯 Recommendation

Uh oh!

github-actions bot commented Mar 27, 2026

Review of PR #281: Optimize keyword delimiter matching

🐛 Critical Bug: Dictionary Mapping Loses Duplicate Start Strings

🔧 Suggested Fix

Create a mapping to look up ALL delimiters by their matched start string

✅ What's Good

📋 Additional Recommendations

🎯 Verdict

Uh oh!

github-actions bot commented Mar 27, 2026

Code Review

🐛 Critical Issue: Duplicate Start String Handling

🔧 Suggested Fix

✅ Positive Aspects

📋 Testing Recommendation

🎯 Recommendation

📊 Impact Assessment

Uh oh!

github-actions bot commented Mar 27, 2026

Code Review

Issue: Dictionary Comprehension Loses Duplicate Start Strings

The Fix

Additional Observations

Recommendation

Uh oh!

aiedwardyi commented Mar 28, 2026

Uh oh!

bashandbone commented Mar 28, 2026

Uh oh!

bashandbone commented Mar 28, 2026

Uh oh!

google-labs-jules bot commented Mar 28, 2026

Uh oh!

bashandbone commented Mar 27, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 27, 2026 •

edited

Loading